Natural Language Processing
Text Preprocessing & Feature Creation
= convert raw text into numeric features suitable for ML models
Basic Text Preprocessing
- Lowercasing
- Removing punctuation
- Removing stop words (e.g., 'and', 'the', 'is', 'in')
- Stemming/Lemmatization: reduces words to their root form (e.g., 'running' -> 'run', 'studies' -> 'studi')
Tokenization
= split sequence of characters into “tokens”
- word tokenization
- subword tokenization (BPE, WordPiece): used by modern Transformer
- character tokenization
Creating Features from Text Data
= represent text based on the words it contains
Bag-of-Words (BoW)
= represents text as a vector of word counts.
- How it works
- tokenization
- vocabulary building: create a list of all unique tokens
- vector creation: create a vector where each element corresponds to the count of a word
TF–IDF (Term Frequency–Inverse Document Frequency)
= weighted Bag-of-Words, which highlights words that are more significant or informative
- How it works
- Term Frequency (TF): measures the frequency of a term appears in a specific document
- Inverse Document Frequency (IDF): Measures how important a term is across the entire corpus.
- The TF-IDF score is the product of the two.
Word embeddings
What are embeddings
- = a type of distributed representation that represent words or text as real-valued vectors in a continuous high-dimensional space
- Words with similar meaning appear close in vector space.
- The simplest word embeddings (word2vec, Glove), map each word to a 300 dimensional vector.
What to use, and when
- context oblivious: assign one vector per word, regardless of context
- mostly of historical interest for text
- examples: Latent Semantic Analysis (LSA), word2vec, fastText
- Still used for - small datasets, images embeddings
- context sensitive: a word's meaning depends on surrounding context
-
modern NLP
-
examples: BERT and other Transformer models
-
produce embeddings for entire sentences or tokens
-
NLP Component Tasks
= core building blocks that many NLP systems rely on
- Part of Speech Tagging (POS tagging)
- mark the words in the corpus to its corresponding part of a speech tag (noun, verb, adjective, etc), based on its context and definition
- Named Entity Recognition (NER)
- identify and classify named entities in text into predefined categories such as person names, organization names, locations, dates, etc.
- Co-reference detection
- identify all the expressions in a text that refer to the same entity
- Parsing
NLP Problem Types

Example: Bag-of-Words Workflow
A classical ML approach:
- Tokenize training text
- Convert to BoW or TF–IDF vectors
- Fit a model (e.g., logistic regression)
- Vectorize test text using the same vocabulary
- Predict labels (Yes/No, categories, sentiment, etc.)